Common peak approach using mass spectrometry data sets for predicting the effects of anticancer drugs on breast cancer.

We propose a method for biomarker discovery from mass spectrometry data, improving the common peak approach developed by Fushiki et al. (BMC Bioinformatics, 7:358, 2006). The common peak method is a simple way to select the sensible peaks that are shared with many subjects among all detected peaks by combining a standard spectrum alignment and kernel density estimates. The key idea of our proposed method is to apply the common peak approach to each class label separately. Hence, the proposed method gains more informative peaks for predicting class labels, while minor peaks associated with specific subjects are deleted correctly. We used a SELDI-TOF MS data set from laser microdissected cancer tissues for predicting the treatment effects of neoadjuvant therapy using an anticancer drug on breast cancer patients. The AdaBoost algorithm is adopted for pattern recognition, based on the set of candidate peaks selected by the proposed method. The analysis gives good performance in the sense of test errors for classifying the class labels for a given feature vector of selected peak values.


Introduction
Recent technological innovation has brought us comprehensive methods for the analysis of protein expression profi le data, such as Surface-Enhanced Laser Desorption/Ionization Time of Flight (SELDI-TOF) or Matrix-Assisted Laser Desorption/Ionization (MALDI)-TOF mass spectrometry (MS). After initial research on the early detection of ovarian cancer (Petricoin et al. 2002), new methodologies for data analyses have been developed (Yasui et al. 2003;Tibshirani et al. 2004;Geurts et al. 2005;Meleth et al. 2005;Yu et al. 2005). As the TOF data has both amplitude variation and phase variation (Lin et al. 2005), there are ongoing discussions about analytical problems (Conrads et al. 2004;Lyons-Weiler et al. 2005). Briefl y, the issue of analytical approaches includes the alignment and detection of peaks, and the construction of classifi ers for phenotypes. In the methods of detecting peaks, searching local maxima with a local signal-to-noise ratio is the most often used method. However the methods using wavelet transformation (Qu et al. 2003, etc.) were used as well. Recently, Fushiki et al. (2006) proposed the common peak method identifying biomarkers from high-dimensional MS data. This idea is based on the observation that peaks shared with for only few subjects may be noise, whereas peaks shared with more subjects may be signifi cant.
In this paper, we investigated the performance of the common peak method. We applied the common peak method to each group of subjects by class label rather than to the entire group, as by Fushiki et al. (2006). We applied this proposed approach to a SELDI-TOF MS data set from samples of cancer tissue obtained by laser capture microdissection (LCM) before the patients received medication with anticancer drugs. We examined the data set to detect effective peaks and constructed a system for predicting the effects of neoadjuvant therapy with an anticancer drug on breast cancer patients. Neoadjuvant therapy is one of the breast-conserving therapies that uses anticancer drugs before surgery. After the cancer has been reduced in size, surgical excision can be applied. An establishment of prediction systems of the treatment effect of an anticancer drug (e.g. a system predicting a patient being responder or nonresponder) would improve decision making on each patient's best medication. Such prediction systems could prevent useless medication and could help in selecting appropriate personalized therapies. We used the data set for illustrative purposes.
The common peak method allows the use of continuous and discrete covariates for peak intensities. In our study, we also examined both types of covariates in the analyses of statistical testing for each selected common peak and in the construction of a prediction system by using AdaBoost (Freund and Schapire, 1997) which is known as one of the effi cient ensemble learning methods. We found that the classifi er learning from discrete covariates showed very high performance. We discuss below some practical problems of data preprocessing in the mass spectrometry data and provide a guideline for the treatment of these data.

Data Sets Used
In our study, tissues from 65 breast cancer patients were sampled between 2003 and 2004 from the Cancer Institute Hospital at the Japanese Foundation for Cancer Research (Tokyo, Japan). The samples met all the following eligibility criteria: size Ն 3.0 cm of invasive cancer; Stage IIA-IIIB; age Յ 70; bone marrow, liver and kidney functions were maintained (WBC Ն 4000/mm 3 , Plat. Ն 100,000/mm 3 , Hb Ն 10 g/dl, GOT/GPT Ͻ 60/70 U/I) and the patients had no other serious complications.
All patients received docetaxel 75 mg/m 2 four times weekly every three weeks as neoadjuvant chemotherapy before surgery. Fresh cancer tissue biopsies were taken before treatment and cancer cells were isolated by LCM. To ensure the high quality of the samples, LCM was used in proteomic analyses (Verma et al. 2001;Batorfi et al. 2003;Wulfkuhle et al. 2003;Krieg et al. 2005). The treatment effect was judged from the pathology of specimens removed at subsequent surgery. Fortytwo patients were classed into a nonresponding group with pathology grade Յ1a (mild response). The other 23 patients with pathology grade Ն1b (moderate response) were classed as a group that responded to treatment. Here the pathological response 1a (mild response) is defi ned as mild changes in cancer cells regardless of the area, or marked changes seen in less than one third of cancer cells, and 1b (moderate response) is defi ned as marked changes in one third or more but less than two thirds of tumor cells (The Japanese Breast Cancer Society, 2005). The total set of 65 patients was randomly separated into 50 training samples and 15 test samples, as shown in Table 1.

Preprocessing
In MS data, the x-axis shown here denotes the time of fl ight that was transformed into the mass-tocharge ratio (m/z value), and the y-axis denotes the intensity. Procedures for preprocessing observed spectra were as follows: (1) baseline subtraction, (2) alignment of mass spectra and (3) normalization. We used SpecAlign software (Wong et al. 2005) for procedures (1) and (2). Generation of spectrum averages and alignment using the peak matching method were performed in procedure (2). Normalization was performed using the method of Baggerly et al. (2004). For a single spectrum, let V i denote the raw intensity at the i-th m/z value, and let V min and V max denote the smallest and largest observed intensities in the spectrum, respectively. Then the normalized intensity NV i is given by and all NV i 's are in the [0, 1] range.

Common peak method
To identify proteins associated with phenotypes, we needed to discriminate between noise and peaks of protein expression in observed spectra. Fushiki et al. (2006) proposed a new method for peak detection using published data. The principle of this method is that peaks may come from protein expression rather than noise when peaks are commonly observed in a major portion of subjects. Using this method, covariates for prediction were constructed in the following order: 1. Peak detection for each subject 2. Common peak detection among subjects 3. Calculation of discrete and continuous covariates by each subject.

Peak detection for each subject
First, we detected peaks for each subject, using the method proposed by Yasui et al. (2003). We set a k-nearest neighborhood as a width of window on the x-axis. Here we used k = 10. An m/z value achieving the maximum intensity in that window is regarded as a peak. We then moved the window along the x-axis and search for peaks in the spectrum of each subject.

Common peak detection among subjects
Next, we constructed common peaks by a responder and a nonresponder group from each individual's peaks. In this step, we calculated an average of peaks, A(x), constructed by averaging Gaussian kernels with centers at the individual peak. Then A(x) was expressed as (1) where N G is the sample size of each group, p i,j is the m/z value of the i-th subject's j-th peak and σ is a parameter accounting the width of the peak. We used here σ = 0.001. The common peak is defi ned by the point x in which A(x) is greater than a certain threshold h. Figure 1 shows the curve of A(x) at x = [3000, 4000] from nonresponders. Here we used h = 0.5. Fushiki et al. (2006) obtained common peaks using all subjects of a study at once, however, in our work we applied the common peak method separately to each of two groups, as our purpose was to detect informative peaks for discrimination between responder and nonresponder. The feature of our approach is that it uses information of labels of groups but it does not intend any discrimination for a particular common peak. If the same common peak is selected for both groups, it would not help in discrimination. However, when a common peak is detected only in one group, then that peak would be an appropriate candidate for classifi ers. Below we will compare the proposed method with that by Fushiki et al. (2006).

Calculation of discrete and continuous covariate by each subject
We often analyze data sets with discrete covariates, which are dichotomous codes with 0 and 1 rather than direct intensity when there might be a relative large error of intensity of SELDI and MALDI. In this case, a covariate x j for the common peak m j for the j-th peak for a certain subject might be obtained as follows: ( 2) where ρ adjusts a width of the window around the common peak m j , and should be set according to precision of m/z values. On the other hand, when we aim to use the observed value of intensity, the continuous covariates x j is defi ned as follows: In this study, we employed both cases of discrete and continuous covariates with ρ = 0.0005.

AdaBoost
AdaBoost is one of the machine learning algorithms which is ensembles of statistical classifi ers that are more accurate than a single classifi er. It is known that the boosting algorithm is highly resistant to overfi tting in the discovery of protein biomarkers (Yasui et al. 2003).
We consider here a set of the training data set D = {(x i , y i ):i = 1, … , N}, where x is an input vector and y ∈ {+1, −1} is a class label. In this paper x corresponds to a set of covariates based on the common peaks. Let be a set of weak classifi ers. Here J is the total number of common peaks among groups. Then the AdaBoost algorithm is described as follows: 1. Set an initial value of weight w 1 (i) = , (i = 1, … , N). 2. Defi ne a weighted error rate for t-th iteration, t = 1, … , T, as where I represents indicator function and w t is a weight at t-th iteration. Next, Here we normalize the weight such that 3. where In step (2a), f t is adjusted for a restriction that the weighted error rate must be less than 0.5. If it exceeds 0.5, then we use −f t instead of f t as a classifi er. Furthermore, step (2c) can be expressed as: We update the weight by multiplying with exp(α t ) when f t misjudges the i-th subject, and by multiplying with exp (−α t ) when f t judges correctly the i-th subject (see Murata et al. 2004). We adopted cross-validation (CV) for selecting the number of classifi ers T as follows: We select the minimum of T such that the integrated classifi er f T in step 3 attains local minima and has CV errors with no more than one standard error above the minimum CV error (see Hastie et al. 2001).

Common peak detection
From the training data set, we obtained 92 common peaks for the responder group of 18 patients and 81 common peaks for the nonresponder group of 32 patients. All common peaks which were detected for at least one group were used for analysis. In total, 117 common peaks were obtained. We calcu- lated both discrete and continuous covariates for these 117 common peaks.

Construction of classifi ers
To construct a classifier, we analyzed the training data set using AdaBoost and computed the training and CV errors. CV error was calculated by replicating a five-fold cross-validation 50 times and averaging the errors. Figure 2(a) shows the error curves of the discrete case. The CV error (dashed line) was minimized locally at T = 6 and the error rate at T = 6 did not differ statistically significant from that of the best model (T = 15). Therefore, we selected the sixpeaks model for the discrete case. The error curves of the continuous case with normalization are shown in Figure 2(b), but the CV error rates for entire range of T were much worse than that for the discrete covariates. Therefore there were not any comparable model for the continuous case.

Validation result
Using the six-peaks model, we predicted treatment effects, (i.e. "responder or nonresponder"), for each subject in the test data of 15 subjects. The test error was 1/15 for the discrete covariates ( Fig. 2(a)). Figure 3 shows the prediction scores for all subjects of the test data using discrete covariates. The prediction score F ′(x) is given by (5) This score has the property that when the score is more distant from the zero value, the prediction is more confi dent.

Single peak analysis
For selecting six peaks by AdaBoost, we performed Fisher's exact test to investigate whether the frequency of the peak differed between groups. Table 2 indicates that the result was signifi cant for all six peaks.

Comparison with the original common peak method
To compare it with the original common peak method, we analyzed the same data set using the unsupervised method of Fushiki et al. (2006). This method detected 81 common peaks from 50 samples of the training data set, but there were no differences in the frequencies of peaks between responders and nonresponders. AdaBoost was also applied but the training error rate could not be forced to become zero. It was not possible to discriminate the training data set suffi ciently when using the original common peak method. From this we conclude that the new method is superior for our task.

Comparison with another classification approach
It is important to compare this method with other classifi cation methods such as the support vector machine-recursive feature elimination (SVM-RFE) approach (Guyon et al. 2002). We re-analyzed our data using the SVM-RFE with polynomial and Gaussian kernels instead of AdaBoost. In the discrete case, the fi fteen peaks model with 2nd degree polynomial kernel was selected by the RFE. The fi ve-fold CV error of this model was 0.024 and the test error was 2/15. Only three peaks (m/z values 1361, 2250 and 2989) selected by AdaBoost were included in the fi fteen peaks model. The SVM-RFE approach gave comparable results, but we concluded AdaBoost was better for our analysis in that the number of peaks in the model was small and the test error was small.

Discussion
In the common peak method proposed here, one has to set four parameters (k, h, ρ, σ) to select common peaks and give the individual covariates. If the parameter k concerning window width is small, as the probability of selecting false positives is high and hence the baseline of average peaks is also high. We adopted k = 10 following the original method of Fushiki et al. (2006); we also tried k = 20, but the resulting common peaks showed no difference. The threshold parameter h was used for detection of the common peaks.
When the sample size N G is small in equation (1), the impact of uncommon peaks on A(x) is large. Therefore in such a case h should also be large. Parameters ρ and σ, should be set to properly account for the width of the peak, because it is diffi cult to align spectra perfectly in the stage of preprocessing. SELDI-TOF machine we used has an error of 0.002 about its m/z values, and ρ and σ should be less than 0.002 because the spectrum . Prediction scores of the test data for discrete covariates using six-peaks model. The "○" indicate the fi ve responders and the "×" indicate the ten nonresponders. Only one of the subjects with response "1b" is misclassifi ed.
alignment had been performed already. In our study, we set ρ = 0.0005 and σ = 0.001 for this reason.
In our study, we realized that the result for continuous covariates was worse than that for discrete covariates. Continuous covariates are calculated based on the intensity with baseline subtraction. Hence they become sensitive to the variation caused by the methods or parameter settings of baseline subtraction. In contrast, discrete covariates are not infl uenced by baseline subtraction. In this sense, we supported the results based on the case of discrete covariates in this analysis.
On the use of the common peak method, if the spectra alignment is insufficient, the number of detected common peaks becomes smaller, and consequently it may fail to detect important peaks. Therefore, preprocessing-especially for spectra alignment-is important when analyzing MS data.
It is important to compare the common peak method with other peak selection methods. We analyzed our data using the Ciphergen ProteinChip software (Fung and Enderwick, 2002) with default settings, 82 peaks of 117 peaks obtained by our method were detected from 50 training samples, and only three peaks (m/z values 1361, 2250 and 2621) were detected of the six peaks we selected as the biomarkers. ProteinChip software judges whether an intensity is a peak caused by protein expression or noise for each single spectrum. Then peak alignment after the peak detection is needed for classification. On the other hand, the common peak method detects peaks from multiple spectra, and hence our method can detect peaks more sensitively.
In order to see how the results would be affected if different training and test sets were used, we examined randomly splitting the spectra into training and test 10 times. The mean of the number of selected peaks was 7.4 and its range was 4 to 12. Among the six peaks selected in Section 4.2, the peak with m/z value 2843 was selected in all 10 cases, and the other peaks were selected at least three cases. The mean test error was 0.087, and this error rate had little difference from 1/15(=0.067). Therefore it was confi rmed that our result did not depend on the choice to split training and test sets.
We compared the proposed method with the SVM-RFE method in Section 4.6 and concluded that AdaBoost was slightly better than the SVM-RFE in our analysis, but there was no signifi cant difference. However, the SVM-RFE method has more computational complexity, because the SVM-RFE learns using all variables fi rst to compute the ranking criterion for all variables, and removes the least important variable in a sequential manner. On the other hand, AdaBoost minimizes sequentially the exponential loss function and selects important variables simultaneously. AdaBoost has an advantageous point from the computational reason with an appropriate stopping rule.
Validating common peaks obtained here on other available breast cancer data sets is important. Pusztai et al. (2004) used SELDI-TOF MS profi ling to examine proteomic changes in plasma of patients with breast carcinoma who received either preoperative or postoperative chemotherapy for Stage I-III breast carcinoma. They detected only one treatment-induced protein/peptide peak (m/z value 2790) and reported fi ve peaks (m/z values 3165, 3440, 4115, 4444, and 8940) that expressed in plasma obtained from women breast carcinoma. Among the common peaks obtained in our study, one peak with m/z value 3444 was close to one of the peaks (m/z value 3440) reported by Pusztai et al. (2004). This peak was observed in both case and control groups as a common peak. However, it is hard to confi rm the consistency, because there are many differences between two studies compared. We used microdisected cancer tissue, but Pusztai et al. (2004) used plasma and they used paclitaxel chemotherapy or 5-fl uorouracil, doxorubicin, and cyclophosphamide (FAC) chemotherapy. Therefore, further examination will be needed to validate these peaks and biomarkers obtained here.